knitr::include_graphics("ioc-7-extending.pdf")
knitr::include_graphics("ioc-7-theory.pdf")
The main purpose of the exercises below is to recreate regressions presented earlier on the slides, using the freedom house data set.
dat <- read.csv("fh_2015.csv")
The first two columns are Freedom House Political Rights score for listed countries, dated 2015;
head(dat[,1:2])
## Country Political.Rights.value
## 1 Afghanistan 6
## 2 Albania 3
## 3 Algeria 6
## 4 Andorra 1
## 5 Angola 6
## 6 Antigua & Barbuda 2
the next three are from the International Monetary Fund, GDP (expressed in PPP) values in international dollars;
head(dat[,3:5])
## Country.1 Units PPP
## 1 Afghanistan Current international dollar 1976.18
## 2 Albania Current international dollar 11885.06
## 3 Algeria Current international dollar 14470.55
## 4 NA
## 5 Angola Current international dollar 7373.04
## 6 Antigua and Barbuda Current international dollar 22965.79
last two are Freedom House ‘Freedom of the Press scores’ for 2015.
head(dat[,6:7])
## Country.2 FoP
## 1 Afghanistan 67
## 2 Albania 49
## 3 Algeria 61
## 4 Andorra 13
## 5 Angola 70
## 6 Antigua and Barbuda 38
Regress political rights scores on freedom of press scores for Afganistan, Colombia, and the United Kingdom.
Store results in m1 (stands for model 1).
# identify row numbers for these countries,
# add multiple via 'or' operator: | horizontal bar.
which(dat$Country=="Afghanistan"|dat$Country=="Colombia"|dat$Country=="United Kingdom")
## [1] 1 39 191
# formula always specified as y ~ x
m1 <- lm(dat$FoP[c(1,39,191)] ~ dat$Political.Rights.value[c(1,39,191)])
What is the degrees of freedom in model 1?
3-2
## [1] 1
where 3 is the number of observations, and 2 is the number of parameters in the model: an intercept, and a slope for ‘political rights’ score.
Next: calculate predicted values, \(\hat{Y_i}\) where \(i =\) Afghanistan, Colombia, and UK; of FoP for the three observed Political Rights value, using information from the model.
Hint: the equation is \(\hat{Y}=\beta_0 + \beta_1X\).
# the intercept, Beta0 is
coef(m1)[1]
## (Intercept)
## 21.21053
# the slope, Beta1 is
coef(m1)[2]
## dat$Political.Rights.value[c(1, 39, 191)]
## 8.236842
# thus using the observed Political rights values for each country, namely:
dat$Political.Rights.value[c(1, 39, 191)]
## [1] 6 3 1
# and the regression equation being Yhat = Beta0 + Beta1 * X, we calculate by hand
yhat_afgh <- 21.211 + 8.237*6
yhat_col <- 21.211 + 8.237*3
yhat_uk <- 21.211 + 8.237*1
c(yhat_uk,yhat_afgh,yhat_col)
## [1] 29.448 70.633 45.922
Next, plot the observations in model 1 as well as the regression line
plot(x=dat$Political.Rights.value[c(1,39,191)],
y=dat$FoP[c(1,39,191)],ylim=c(25,75),
ylab = "FoP",xlab="Political rights")
# with ylim I reset span of Y axis otherwise truncated;
# with xlab and ylab I rename axis labels. all this is optional
abline(m1,col="red")
We will use this plot to visualise residuals below.
Residuals mean deviation from the red line, \(\hat{Y_i}\), the individual predictions.
Let’s plot these distances by connecting the dots with the red line: the \((X,Y)\) coordinates with \((X,\hat{Y})\) coordinates.
Hint: ?lines(). You will need to give coordinates in format (X,X) (Y,Yhat) to denote a single line.
lines(c(1,1),c(dat$FoP[191],yhat_uk))
lines(c(3,3),c(dat$FoP[39],yhat_col))
lines(c(6,6),c(dat$FoP[1],yhat_afgh))
Now fit m1 and m2 where political rights scores are regressed on freedom of press scores, but m1 only uses observations for Afganistan, Colombia, and the United Kingdom whereas m2 uses all observations.
rowselect <- which(
dat$Country=="Afghanistan"|dat$Country=="Colombia"|dat$Country=="United Kingdom"
)
# formula always specified as y ~ x
m1 <- lm(dat$FoP[rowselect] ~ dat$Political.Rights.value[rowselect])
m2 <- lm(dat$FoP ~ dat$Political.Rights.value)
Let’s print the model summaries for reference later. In this summary, identify the estimates for the intercept and for the slope, we’ll need them later.
summary(m1) # the slope estimate is 8.24
##
## Call:
## lm(formula = dat$FoP[rowselect] ~ dat$Political.Rights.value[rowselect])
##
## Residuals:
## 1 2 3
## -3.632 9.079 -5.447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.211 12.315 1.722 0.335
## dat$Political.Rights.value[rowselect] 8.237 3.145 2.619 0.232
##
## Residual standard error: 11.19 on 1 degrees of freedom
## Multiple R-squared: 0.8728, Adjusted R-squared: 0.7455
## F-statistic: 6.859 on 1 and 1 DF, p-value: 0.2322
summary(m2) # the slope estimate is 10.02
##
## Call:
## lm(formula = dat$FoP ~ dat$Political.Rights.value)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.7341 -6.1948 0.2659 6.2954 23.3640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.6164 1.2856 11.37 <2e-16 ***
## dat$Political.Rights.value 10.0196 0.3232 31.00 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.687 on 193 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.8328, Adjusted R-squared: 0.8319
## F-statistic: 961 on 1 and 193 DF, p-value: < 2.2e-16
Next: inspect model \(R^2\)s, previously expressed as the ‘explained’ sum of squares (how much do your predicted values, the \(\hat{Y}\)s vary?) over the ‘total’ sum of squares (how much do your raw values, the \(Y\)s vary?); see Handout 6 for details.
Now let’s grab these from the model summaries:
# dollar sign takes this list item from a list of many other things,
# you can inspect all elements via str() such as via(summary(m1))
summary(m1)$r.squared
## [1] 0.8727595
summary(m2)$r.squared
## [1] 0.8327512
We interpreted this as information about model fit, the percentage of variation in freedom of press scores explained by the model namely political rights scores.
Let’s now turn to hypothesis testing.
First key statistic is the standard error of the slope. Read as variability around your estimate of the slope or uncertainty in your effect size. They are also in the model summary right next to the estimate, namely:
summary(m2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.61636 1.2856020 11.36928 2.937715e-23
## dat$Political.Rights.value 10.01962 0.3232185 30.99951 7.119023e-77
It’s not very difficult to hand-calculate this standard error of estimate, especially with few observations, let’s calculate it for m1 as \(se(\hat{\beta_1})=\sqrt{\frac{\hat{\sigma}^2}{\sum{(X_i-\bar{X})^2}}}\) where \(X\)s are the political rights scores, and \(\bar{X}\) is their mean. This comes from slide 15 within week 6 slides.
Reminder: sigma square in the numerator is the residual sum of squares: \(\hat{\sigma}^2=\frac{\sum{u^2_i}}{n-2}\), where the \(u_i\)’s are the residuals for each observation, i. You can find this equation on
# taking residuals straight from model object via residuals() function
sigma2 <- sum(residuals(m1)^2)/1
# we'll need the Xs,
dat$Political.Rights.value[rowselect]
## [1] 6 3 1
# and their mean Xbar,
mean(dat$Political.Rights.value[rowselect])
## [1] 3.333333
# in the denominator which is
denominator <- sum ( ( dat$Political.Rights.value[rowselect] -
mean(dat$Political.Rights.value[rowselect]) )^2 )
se.B <- sqrt(sigma2/denominator)
se.B
## [1] 3.14504
Next we look at the confidence bounds around the estimate (of the slope, in this case) which is very roughly estimate plus error for an upper bound, and estimate minus error for a lower bound, but this needs to be expressed in a standardised format so that we can talk about probabilities later.
The precise equation is equation 4 on slide 14 within week 6 slides. In a simplified form, upper bound is upr = estimate + se.B * qt(0.95,1) where the qt() function finds the t value in the t distribution (tables available online) at the 95th percentile or cutpoint, with 1 degrees of freedom. Likewise, the lower bound is lwr = estimate - se.B * qt(0.95,1)
# as.numeric(paste()) here just gets rid of the fluff around the numbers (labels)
estimate <- as.numeric(paste( coef(m1)[2] ))
upr <- as.numeric(paste( coef(m1)[2] + se.B*qt(0.95,1) ))
lwr <- as.numeric(paste( coef(m1)[2] - se.B*qt(0.95,1) ))
# list results within bounds
c(lwr,estimate,upr)
## [1] -11.620157 8.236842 28.093841
This is the 90% confidence interval ( sorry for improvised hand drawing)
confint(m1,level=.9)
## 5 % 95 %
## (Intercept) -56.54511 98.96617
## dat$Political.Rights.value[rowselect] -11.62016 28.09384